Traffic Accident Analysis : Identifying Patterns and Underlying Road Conditions¶

Goal :
The objective of this project is to identify and analyze key factors contributing to road traffic accidents, with the aim of understanding patterns related to road conditions, weather, and time of the day.

About Dataset :
The dataset was gotten from kaggle and it contains 2 tables which include the RTA data which is the dataset before preprocessing and the cleaned (preprocessed) version of the data, we'll be making use of the cleaned data in our analysis. The dataset contains manual records of road traffic accidents of Addis Ababa City, Ethiopia of the year 2017 - 20 collected from the Sub city Police departments for Masters Research work.
Click here to check out the dataset : Traffic Data

Importing Libraries¶

In [1]:
import pandas as pd
import plotly.express as px

Importing Dataset¶

In [2]:
df = pd.read_csv("C:/Users/obalabi adepoju/Documents/traffic.csv")

Data Inspection¶

We'll look at a general overview of our data and a description of each column.

In [3]:
df.head(10)
Out[3]:
Age_band_of_driver Sex_of_driver Educational_level Vehicle_driver_relation Driving_experience Lanes_or_Medians Types_of_Junction Road_surface_type Light_conditions Weather_conditions Type_of_collision Vehicle_movement Pedestrian_movement Cause_of_accident Accident_severity
0 18-30 Male Above high school Employee 1-2yr Unknown No junction Asphalt roads Daylight Normal Collision with roadside-parked vehicles Going straight Not a Pedestrian Moving Backward 2
1 31-50 Male Junior high school Employee Above 10yr Undivided Two way No junction Asphalt roads Daylight Normal Vehicle with vehicle collision Going straight Not a Pedestrian Overtaking 2
2 18-30 Male Junior high school Employee 1-2yr other No junction Asphalt roads Daylight Normal Collision with roadside objects Going straight Not a Pedestrian Changing lane to the left 1
3 18-30 Male Junior high school Employee 5-10yr other Y Shape Earth roads Darkness - lights lit Normal Vehicle with vehicle collision Going straight Not a Pedestrian Changing lane to the right 2
4 18-30 Male Junior high school Employee 2-5yr other Y Shape Asphalt roads Darkness - lights lit Normal Vehicle with vehicle collision Going straight Not a Pedestrian Overtaking 2
5 31-50 Male Unknown Unknown Unknown Unknown Y Shape Unknown Daylight Normal Vehicle with vehicle collision U-Turn Not a Pedestrian Overloading 2
6 18-30 Male Junior high school Employee 2-5yr Undivided Two way Crossing Unknown Daylight Normal Vehicle with vehicle collision Moving Backward Not a Pedestrian Other 2
7 18-30 Male Junior high school Employee 2-5yr other Y Shape Asphalt roads Daylight Normal Vehicle with vehicle collision U-Turn Not a Pedestrian No priority to vehicle 2
8 18-30 Male Junior high school Employee Above 10yr other Y Shape Earth roads Daylight Normal Collision with roadside-parked vehicles Going straight Crossing from driver's nearside Changing lane to the right 2
9 18-30 Male Junior high school Employee 1-2yr Undivided Two way Y Shape Asphalt roads Daylight Normal Collision with roadside-parked vehicles U-Turn Not a Pedestrian Moving Backward 1
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12316 entries, 0 to 12315
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Age_band_of_driver       12316 non-null  object
 1   Sex_of_driver            12316 non-null  object
 2   Educational_level        12316 non-null  object
 3   Vehicle_driver_relation  12316 non-null  object
 4   Driving_experience       12316 non-null  object
 5   Lanes_or_Medians         12316 non-null  object
 6   Types_of_Junction        12316 non-null  object
 7   Road_surface_type        12316 non-null  object
 8   Light_conditions         12316 non-null  object
 9   Weather_conditions       12316 non-null  object
 10  Type_of_collision        12316 non-null  object
 11  Vehicle_movement         12316 non-null  object
 12  Pedestrian_movement      12316 non-null  object
 13  Cause_of_accident        12316 non-null  object
 14  Accident_severity        12316 non-null  int64 
dtypes: int64(1), object(14)
memory usage: 1.4+ MB
In [5]:
print(f"This dataset contains {df.shape[0]} columns and {df.shape[1]} rows")
This dataset contains 12316 columns and 15 rows

Here are the descriptions of each column in the dataset:

  1. Age_band_of_driver: The age group of the driver involved in the accident (e.g., 18-30, 31-50).

  2. Sex_of_driver: The gender of the driver involved in the accident (e.g., Male, Female).

  3. Educational_level: The highest level of education attained by the driver (e.g., Above high school, Junior high school).

  4. Vehicle_driver_relation: The relationship between the driver and the vehicle (e.g., Employee, Owner).

  5. Driving_experience: The number of years the driver has been driving (e.g., 1-2yr, Above 10yr).

  6. Lanes_or_Medians: The type of road or median where the accident occurred (e.g., Undivided Two way, other).

  7. Types_of_Junction: The type of junction where the accident took place (e.g., No junction, Y Shape).

  8. Road_surface_type: The type of road surface at the accident location (e.g., Asphalt roads, Earth roads).

  9. Light_conditions: The lighting conditions at the time of the accident (e.g., Daylight, Darkness - lights lit).

  10. Weather_conditions: The weather conditions during the accident (e.g., Normal, Rainy).

  11. Type_of_collision: The nature of the collision (e.g., Collision with roadside-parked vehicles, Vehicle with vehicle collision).

  12. Vehicle_movement: The movement of the vehicle(s) involved in the accident (e.g., Going straight, Turning).

  13. Pedestrian_movement: The movement of pedestrians involved in the accident, if any (e.g., Not a Pedestrian).

  14. Cause_of_accident: The primary cause or contributing factor of the accident (e.g., Overtaking, Changing lane).

  15. Accident_severity: The severity of the accident, usually categorized by the level of damage or injury (e.g., Slight, Fatal).

In [6]:
#We'll start off with renaming some of our columns for ease when analyzing
df.rename(columns={'Age_band_of_driver':"age",'Sex_of_driver':"sex",'Educational_level':"education",
                   'Vehicle_driver_relation':"relation",'Driving_experience':"experience",'Lanes_or_Medians':"L/M",
                   'Road_surface_type':"surface",'Types_of_Junction':"tjunction",'Light_conditions':"light",
                   'Weather_conditions':"weather",'Type_of_collision':"tcollision",'Vehicle_movement':"vmovement",
                   'Pedestrian_movement':"pmovement",'Cause_of_accident':'cause','Accident_severity':'Severity'},inplace = True)

Age¶

We'll be going through the columns to gain an even deeper undertanding of our dataset.
We'll be starting with our Age column.

In [7]:
#I'll replace each each range to a category to make things easier

#so let's check out our ranges
df['age'].value_counts()
Out[7]:
age
18-30       4271
31-50       4087
Over 51     1585
Unknown     1548
Under 18     825
Name: count, dtype: int64
In [8]:
# With that we can now map our data to its categories 
df.age = df.age.replace({'18-30' : "Young Adults",
                    '31-50' : "Older Adults",
                    'Over 51' : 'Elderly',
                    "Under 18" : 'Child'})
In [9]:
fig = px.histogram(df, x = 'age', title = 'Age Distribution', color_discrete_sequence = ['blue'])

fig.show()
  • The adult population makes up the majority of road users and, correspondingly, the majority of those involved in accidents, while children account for the smallest proportion. This distribution reflects the general demographics of road users.

Sex¶

In [10]:
df.sex.value_counts()
Out[10]:
sex
Male       11437
Female       701
Unknown      178
Name: count, dtype: int64
In [11]:
''' Due to the few number of unknowns present and the sheer number of the male populace,
it's safe to replace our unknowns with the male category '''

df.sex.replace(['Unknown'],df.sex.mode(),inplace=True)
In [12]:
fig = px.histogram(df, x = 'sex', title = 'Gender Distribution',
                   color = 'sex',color_discrete_map = {'Male':'blue','Female':'grey'})
fig.show()

Eductional Level¶

In [13]:
df.education.unique()
Out[13]:
array(['Above high school', 'Junior high school', 'Unknown',
       'Elementary school', 'High school', 'Illiterate',
       'Writing & reading'], dtype=object)
In [14]:
df.education.value_counts()
Out[14]:
education
Junior high school    7619
Elementary school     2163
High school           1110
Unknown                841
Above high school      362
Writing & reading      176
Illiterate              45
Name: count, dtype: int64
Interestingly, most individauls involved in road accidents seem to have only been able to achieve middle school certificates with this group alone accounting for over 60 % of our data. Looking at what we have above, it's reasonable to imply those who didn't attain a higher educational certificate are prone to accidents more than others perhaps due to foundational or basic knowledge of roads and driving as our top 3 are those with either a high school certificate or less with the 3 groups occupying about 88.4 % of the entire data.

Relation¶

In [15]:
df.relation.value_counts()
Out[15]:
relation
Employee    9627
Owner       1973
Unknown      593
Other        123
Name: count, dtype: int64

We've gone through a few of our columns and we see our unknowns are unavoidable so we'll just be treating it together with our Here's a refined version of your statement:

  • The data reveals that the majority of individuals involved in accidents—approximately 78%—are driving vehicles as part of their job duties and do not own the vehicle. This insight highlights that professional drivers, who are not personally responsible for the cost of accidents, may be more prone to risky behavior. This raises the possibility that a lack of personal ownership might contribute to a higher level of carelessness.

Experience¶

In [16]:
df.experience.value_counts()
Out[16]:
experience
5-10yr        3363
2-5yr         2613
Above 10yr    2262
1-2yr         1756
Below 1yr     1342
Unknown        829
No Licence     118
unknown         33
Name: count, dtype: int64
In [17]:
df.experience = df.experience.replace({'5-10yr':4,'2-5yr':3,'Above 10yr':5,'1-2yr':2,
                                   'Below 1yr':1,'unknown':'Unknown','No License':'Unknown'})
In [18]:
fig = px.violin(df, x = 'experience', title = 'Age Distribution', color_discrete_sequence = ['blue'])

fig.show()
  • "Our plot provides valuable insights, showing that the majority of our data falls within ranks 2 to 4, indicating that most drivers have between 1 to 10 years of experience. Notably, drivers with 5 to 10 years of experience (rank 4) are the most prominent group. The plot also reveals a steady increase in experience levels, suggesting that a growing number of drivers have accumulated more years of experience."

Lanes / Median¶

In [19]:
df['L/M'].value_counts()
Out[19]:
L/M
Two-way (divided with broken lines road marking)    4411
Undivided Two way                                   3796
other                                               1660
Double carriageway (median)                         1020
One way                                              845
Unknown                                              442
Two-way (divided with solid lines road marking)      142
Name: count, dtype: int64

A short description of our categories.

  • "We observe some expected trends here: dual-lane roads with broken or poorly maintained line partitions are the primary locations for accidents, alongside undivided two-way lanes. These roads are notorious for their difficulty to navigate, with crowded vehicles and drivers often straying into the wrong lane. It’s no surprise that these two categories account for about 66% of our data.
  • On the other hand, one-way and divided two-way lanes with clear, solid markings have the lowest accident rates. Surprisingly, double carriageway roads also show high accident numbers, likely due to high speeds or collisions with lane dividers."

Type of Junction¶

In [20]:
df['tjunction'].value_counts()
Out[20]:
tjunction
Y Shape        4543
No junction    3837
Crossing       2177
Unknown        1078
Other           445
O Shape         164
T Shape          60
X Shape          12
Name: count, dtype: int64
In [21]:
#Because crossing and X junctions typically have the same definition, we'll add them together
df['tjunction'] = df['tjunction'].replace({'X Shape': 'Crossing'})

For those not very familiar with road termninologies, here is a short description to broaden your level of understanding: The dataset includes the following types of junctions:

  1. No Junction: The accident occurred on a road section without any junction or intersection.

  2. Y Shape: A junction where one road splits into two, forming a "Y" shape.

  3. Crossing: An intersection where roads cross each other, typically at right angles (similar to an "X" shape).

  4. O Shape: A circular junction, likely a roundabout, where traffic moves in one direction around a central island.

  5. T Shape: A "T" junction where one road ends at a perpendicular intersection with another road.

  1. Considering our data with regards to junctions, most accidents occur occur at Y shape junctions and here are a few reasons why:
    • Multiple Points of Conflict: A Y-junction has multiple points where vehicles can intersect. Unlike a straight road or a simple T-junction, vehicles at a Y-junction may approach from different angles, creating more opportunities for collisions.
    • Visibility Issues: The angles at Y-junctions can sometimes create blind spots or reduce visibility for drivers, making it harder to see oncoming traffic, especially when turning.
    • Turning Movements: At a Y-junction, vehicles often need to make sharp turns, either merging into or crossing traffic. These maneuvers increase the risk of accidents, particularly if drivers misjudge the speed or distance of oncoming vehicles.

These are just some of the reasons Y junctions ae most frequent in our data.

  1. The next most common accidents occur on roads without junctions. Among accidents that do involve junctions, crossing junctions are the second most frequent, while T-junctions are involved in a smaller number of accidents.

Light Conditions¶

In [22]:
df['light'].value_counts()
Out[22]:
light
Daylight                   8798
Darkness - lights lit      3286
Darkness - no lighting      192
Darkness - lights unlit      40
Name: count, dtype: int64
In [23]:
# Let's check out this values in a bar chart

fig = px.histogram(df, x='light', title='Distibution of Light Conditions')

# Update the bar color
fig.update_traces(marker_color='mediumseagreen',marker_line_color='black',marker_line_width=0.5)

# Update the layout to ensure the background is distinct
fig.update_layout(
    title_font=dict(size=20, color='black')  
)

# Show the figure
fig.show()
I would have hypothesized that less light would lead to more accidents but taht wouldn't have factored in the fact that these light categories represent different times of the day and there tends to be less vehicles on the road as the day goes by, perhaps that would explain the decreasing number of accidents as light dims out with number of vehicles at its peak during the daytime.

Weather Conditions¶

In [24]:
df['weather'].value_counts()
Out[24]:
weather
Normal               10063
Raining               1331
Other                  296
Unknown                292
Cloudy                 125
Windy                   98
Snow                    61
Raining and Windy       40
Fog or mist             10
Name: count, dtype: int64
In [25]:
#Due to the considerable difference between the values, well replace our unknowns with the mode
df['weather'] = df['weather'].replace({'Unknown':'Normal'})
  • More than 80 % of accidents occured on a normal bright and sunny day meaning weather conditions didn't play a very huge factor in accidents but we do see the next most weather condition is when it's raining and based on the disparity between its numbers and forthcoming conditions, we can say a little bad weather if only slight contributes to accidents.

  • Another thing to note is the incredibly low numbers amongst certain conditions. Well this is no shock as no one wants to drive on icy roads when it's snowing or during a storm which is even worse than a typical rain and fogs or mist are typically associated with very early mornings where less cars are present.

Vehicle Movement¶

In [26]:
df.vmovement.value_counts()
Out[26]:
vmovement
Going straight         8158
Moving Backward         985
Other                   937
Reversing               563
Turnover                489
Unknown                 396
Getting off             339
Entering a junction     193
Overtaking               96
Stopping                 61
U-Turn                   50
Waiting to go            39
Parked                   10
Name: count, dtype: int64
In [27]:
# Since reversing consists of the same motions as moving backwards, we've decided to join these two together
df['vmovement'] = df['vmovement'].replace({'Reversing':'Moving Backward'})
In [28]:
# Let's check out this values in a histogram

fig = px.histogram(df,'vmovement',title = 'Movement Distribution',color = 'vmovement',
                   color_discrete_map = {'Going straight':'blue'},color_discrete_sequence=['gray'])

fig.update_traces(marker_line_color='black',marker_line_width=0.5)

fig.update_layout(
    xaxis_title="Vehicle Movement",   # Specify x-axis title
    yaxis_title="",   # Specify y-axis title
    showlegend=False,             # Hide legend
    title_font=dict(size=20, color='black')  
)

fig.show()

Type of Collision¶

In [29]:
df.tcollision.value_counts()
Out[29]:
tcollision
Vehicle with vehicle collision             8774
Collision with roadside objects            1786
Collision with pedestrians                  896
Rollover                                    397
Collision with animals                      171
Unknown                                     169
Collision with roadside-parked vehicles      54
Fall from vehicles                           34
Other                                        26
With Train                                    9
Name: count, dtype: int64
In [30]:
# Let's check out its distribution
fig = px.histogram(df,y='tcollision',title = 'Type of Collision',
             category_orders={'tcollision': df['tcollision'].value_counts().index})

fig.update_traces(marker_line_color = 'black')

fig.update_layout(
    xaxis_title="", 
    yaxis_title="Type of Collision", 
    showlegend=False,             # Hide legend   # Remove x-axis grid lines
    yaxis=dict(showgrid=False),# Background color of the plotting area
    title_font=dict(size=20, color='black')  
)

fig.show()
The vast majority of collisions involve vehicles colliding with other vehicles (8,774 incidents), followed by collisions with roadside objects (1,786) and unfortunately, pedestrians (896). Collisions with roadside-parked vehicles (54) are less frequent enforcing what we have previously seen which indicates parked vehicles are involved in the least amount of accidents although not completely off the hook which isn't very surprising as there has been an increase of "maniacs" on the road these last few decades.

Cause of Accidents¶

In [31]:
df.cause.value_counts()
Out[31]:
cause
No distancing                           2263
Changing lane to the right              1808
Changing lane to the left               1473
Driving carelessly                      1402
No priority to vehicle                  1207
Moving Backward                         1137
No priority to pedestrian                721
Other                                    456
Overtaking                               430
Driving under the influence of drugs     340
Driving to the left                      284
Getting off the vehicle improperly       197
Driving at high speed                    174
Overturning                              149
Turnover                                  78
Overspeed                                 61
Overloading                               59
Drunk driving                             27
Unknown                                   25
Improper parking                          25
Name: count, dtype: int64

If you're questioning why overspeeding and high speeds aren't the same, it is because:

  • Driving at high speed: This generally refers to driving at speeds significantly above the average for the road conditions or traffic flow but may not necessarily exceed the legal speed limit. It indicates excessive speed relative to the norm.
  • Overspeeding: This specifically means driving faster than the posted speed limit. It is a more precise term indicating that the speed exceeds legal regulations, regardless of the driving conditions.
 Initially we see the leading causes of accidents is drivers not leaving enough distance in between cars but it's worthy to note that changing lanes to the right or left also have significant numbers so we might say the leading cause of accidents in addis abbaba city is due to changing lanes, changing to the right happens to be more frequent because in Ethiopia, traffic drives on the right side of the road. This means that drivers typically change lanes to the right for slower traffic or when preparing to exit or turn right, similar to other countries with right-hand traffic.

Next we'll combine these two categories together

In [32]:
df['cause'] = df['cause'].replace({'Changing lane to the left':'Changing lanes','Changing lane to the right':"Changing lanes"})
In [33]:
df['cause'].value_counts()
Out[33]:
cause
Changing lanes                          3281
No distancing                           2263
Driving carelessly                      1402
No priority to vehicle                  1207
Moving Backward                         1137
No priority to pedestrian                721
Other                                    456
Overtaking                               430
Driving under the influence of drugs     340
Driving to the left                      284
Getting off the vehicle improperly       197
Driving at high speed                    174
Overturning                              149
Turnover                                  78
Overspeed                                 61
Overloading                               59
Drunk driving                             27
Unknown                                   25
Improper parking                          25
Name: count, dtype: int64
 Now we have the major cause of accidents involve `Changing Lanes` followed closely by no distancing between vehicles which can lead to vehicle with vehicle collision if the vehicle in front suddenly comes to a halt or the one at the back doesn't stop the brake in time whilst moving. Another important cause is reckless drving and although it's a category on it's own, I'd like to highlight giving no priority to vehicles or pedestrians also involves lack of care which tells us that absence of caution and care is a huge factor in accident causes. The last most notable cause involves vehicle movement i.e `Moving Backward` and finally another instance of parking being the least common in our column.

Severity¶

First and foremost, we'll transform our data from numeric values to string, we've gone over some metadata for our dataset so we know thw correct classifications to replace it with.

In [34]:
df['Severity'] = df['Severity'].replace({2:'Slight',1:'Serious',0:"Fatal"})
In [35]:
df.Severity.value_counts()
Out[35]:
Severity
Slight     10415
Serious     1743
Fatal        158
Name: count, dtype: int64
In [36]:
# Let's check out this values in a histogram
fig = px.pie(df,'Severity',title = 'Accident Severity Distribution',color_discrete_sequence=['mediumpurple'],hole=0.5)

fig.update_traces(marker_line_color='black',marker_line_width=0.5)

fig.show()

Exploratory Data Analysis¶

For this next part of our project, we're going to focus on answering the following couple of questions:

  • How is gender distributed across various accident severity?
  • How does accident severity differ between different driver-vehicle relations
  • Is there a relationship between years of experience and severity?
  • Do light conditions affect the severity of an accident?
  • Which type of collisions have the most severity

Creating Our Function¶

In [37]:
def crossdf(col):
    """
    Return a pandas crosstab for the given column against the target variable
    """
    crossdf = pd.crosstab(df['Severity'], df[col], normalize='index')
    crossdf = crossdf.reset_index()
    return crossdf
In [38]:
def melt(data,col) :
    if 'Unknown' in data :
        data.drop(columns='Unknown',inplace=True)
    df_melted = data.melt(id_vars='Severity', var_name=col, value_name='Proportion')
    return df_melted
In [39]:
def viz(data,col):
    fig = px.histogram(data, x='Severity', y='Proportion', color=col,
                   title='Proportion of Accidents by Severity and '+col,barmode = 'group')
    
    fig.update_traces(marker_line_color='black',marker_line_width=0.5)
    
    # Update the layout to ensure the background is distinct
    fig.update_layout(
    title_font=dict(size=20, color='black'),
    yaxis = dict(range = [0,1]),
    yaxis_title = 'Proportion'
        )
    fig.show()

Gender¶

In [40]:
'''We are going to create a cross tab for our different categories for each column,
to make use of their proportion so we can examine our dataset a little more statistically.
'''
df1 = crossdf('sex')
df1
Out[40]:
sex Severity Female Male
0 Fatal 0.031646 0.968354
1 Serious 0.059667 0.940333
2 Slight 0.056841 0.943159
In [41]:
data = melt(df1,'Sex')
data
Out[41]:
Severity Sex Proportion
0 Fatal Female 0.031646
1 Serious Female 0.059667
2 Slight Female 0.056841
3 Fatal Male 0.968354
4 Serious Male 0.940333
5 Slight Male 0.943159
In [42]:
viz(data,'Sex')
  • The information above tells us what the proportion for each category looks like, for those with slight injuries, it shows us the majority of them were males while female population sharply contrasts this, the same is the case for those involved in more serious accidents. For fatal ones, apart from the general disparity, we also notice that the proportion of males involved in fatal accidents is far greater than any other category despite the fact those involved in deathly accidents such as this were very little.

Relation¶

In [43]:
df.relation = df.relation.replace({'Other':'Unknown'})
In [44]:
df1 = crossdf('relation')
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Severity  3 non-null      object 
 1   Employee  3 non-null      float64
 2   Owner     3 non-null      float64
 3   Unknown   3 non-null      float64
dtypes: float64(3), object(1)
memory usage: 228.0+ bytes
In [45]:
data = melt(df1,'Relation')
data
Out[45]:
Severity Relation Proportion
0 Fatal Employee 0.721519
1 Serious Employee 0.778543
2 Slight Employee 0.783101
3 Fatal Owner 0.215190
4 Serious Owner 0.164659
5 Slight Owner 0.158617
In [46]:
viz(data,'Relation')
  • For employees, the proportion involved in accidents generally follows the overall population trend, albeit not strictly, with severity levels decreasing from 78% - 72 % meaning that while employees are involved in more accidents than other categories, their proportion decreases as level of severity increases. In contrast, vehicle owners exhibit the opposite pattern: as accident severity increases, the proportion of owners involved also rises. While this is what we know so far, we will continue to investigate further.

Experience¶

In [47]:
df.experience = df.experience.replace({4:'5-10yr',3:'2-5yr',5:'Above 10yr',2:'1-2yr',
                                   1:'Below 1yr'})

df1 = crossdf('experience')
In [48]:
df1 = crossdf('experience')
data = melt(df1,'Experience')
data
Out[48]:
Severity Experience Proportion
0 Fatal 1-2yr 0.132911
1 Serious 1-2yr 0.130809
2 Slight 1-2yr 0.144695
3 Fatal 2-5yr 0.291139
4 Serious 2-5yr 0.218589
5 Slight 2-5yr 0.209890
6 Fatal 5-10yr 0.259494
7 Serious 5-10yr 0.265060
8 Slight 5-10yr 0.274604
9 Fatal Above 10yr 0.183544
10 Serious Above 10yr 0.185313
11 Slight Above 10yr 0.183389
12 Fatal Below 1yr 0.044304
13 Serious Below 1yr 0.118761
14 Slight Below 1yr 0.108305
15 Fatal No Licence 0.000000
16 Serious No Licence 0.007458
17 Slight No Licence 0.010082
In [49]:
viz(data,'Experience')
  • Each severity level reflects the experience distribution we observed earlier, with 5-10 years being the most common, followed by those with over 10 years, and then 2-5 years of experience. However, an interesting pattern emerges with fatal accidents, where 2-5 years of experience is the most prevalent which suggests, however slightly that fewer years of experience may have an impact on the severity of accidents."

Light¶

In [50]:
df.light = df.light.replace({'Darkness - lights unlit':'Darkness','Darkness - no lighting':'Darkness',
                             'Darkness - lights lit':'Darkness'})

df1 = crossdf('light')
In [51]:
df1 = crossdf('light')
data = melt(df1,'Light Condition')
data
Out[51]:
Severity Light Condition Proportion
0 Fatal Darkness 0.449367
1 Serious Darkness 0.298910
2 Slight Darkness 0.280941
3 Fatal Daylight 0.550633
4 Serious Daylight 0.701090
5 Slight Daylight 0.719059
In [52]:
viz(data,'Light Condition')
  • Our plot tells a clear story: as accident severity worsens, the proportion of accidents occurring during the day decreases, while the opposite is true for nighttime accidents. The proportion of accidents in darkness increases significantly as severity rises.

Type of Collision¶

In [53]:
df1 = crossdf('tcollision')
df1
Out[53]:
tcollision Severity Collision with animals Collision with pedestrians Collision with roadside objects Collision with roadside-parked vehicles Fall from vehicles Other Rollover Unknown Vehicle with vehicle collision With Train
0 Fatal 0.012658 0.139241 0.151899 0.000000 0.000000 0.000000 0.025316 0.012658 0.658228 0.000000
1 Serious 0.015491 0.080895 0.156053 0.002869 0.002295 0.001721 0.030981 0.018359 0.690189 0.001147
2 Slight 0.013634 0.070379 0.143063 0.004705 0.002880 0.002208 0.032549 0.012962 0.716947 0.000672
In [54]:
df1.drop(columns=['Other'],inplace=True)
data = melt(df1,'Type of Collision')
data
Out[54]:
Severity Type of Collision Proportion
0 Fatal Collision with animals 0.012658
1 Serious Collision with animals 0.015491
2 Slight Collision with animals 0.013634
3 Fatal Collision with pedestrians 0.139241
4 Serious Collision with pedestrians 0.080895
5 Slight Collision with pedestrians 0.070379
6 Fatal Collision with roadside objects 0.151899
7 Serious Collision with roadside objects 0.156053
8 Slight Collision with roadside objects 0.143063
9 Fatal Collision with roadside-parked vehicles 0.000000
10 Serious Collision with roadside-parked vehicles 0.002869
11 Slight Collision with roadside-parked vehicles 0.004705
12 Fatal Fall from vehicles 0.000000
13 Serious Fall from vehicles 0.002295
14 Slight Fall from vehicles 0.002880
15 Fatal Rollover 0.025316
16 Serious Rollover 0.030981
17 Slight Rollover 0.032549
18 Fatal Vehicle with vehicle collision 0.658228
19 Serious Vehicle with vehicle collision 0.690189
20 Slight Vehicle with vehicle collision 0.716947
21 Fatal With Train 0.000000
22 Serious With Train 0.001147
23 Slight With Train 0.000672
In [55]:
viz(data,'Type of Collision')
  • For vehicle collisions with roadside objects, vehicles, and pedestrians, the distribution is relatively balanced across all levels of accident severity. However, a closer look reveals that collisions with roadside-parked vehicles, trains, and falls from vehicles are absent in fatal accidents, indicating that these types of accidents are less likely to result in fatalities.

Insights & Conclusion¶

  1. Majority of Accidents by Road Type: Most accidents occur on dual-lane roads with broken or shabby line partitions and undivided two-way lanes, accounting for about 66% of the data.

  2. Vehicle Relation and Accident Severity: For employees, the proportion involved in accidents generally follows the overall trend, with a decrease from 78% to 72% as severity rises, indicating they are involved in more accidents than other categories but less so in severe ones. In contrast, vehicle owners show an increasing proportion with higher severity.

  3. Experience and Accident Severity: The distribution of accident severity generally mirrors experience levels, with 5-10 years of experience being most common. Notably, 2-5 years of experience is prevalent in fatal accidents, suggesting that fewer years of experience may increase the severity of accidents.

  4. Time of Day and Severity: The proportion of accidents occurring during the day decreases with increasing severity, while accidents in darkness increase significantly as severity rises.

  5. Collision Types and Severity: Vehicle collisions with roadside objects, vehicles, and pedestrians show a balanced distribution across severity levels. Collisions with roadside-parked vehicles, trains, and falls from vehicles are not present in fatal accidents, indicating these types are less likely to result in fatalities.